Introduction to Data Analytics

Ziyuan Huang

Last Updated: November 13, 2025

Welcome to ANLY - 500

What is Analytics? - Possible Definition 1

What is Analytics? - Possible Definition 2

Scope of Analytics?

What is Descriptive Analytics? (1)

An Example of what to Expect in Descriptive Analytics: Ex.1.1

library(datasets)
data("sunspot.month") # Load embedded dataset
head(sunspot.month, 10)
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
## 1749  96.7 104.3 116.7  92.8 141.7 139.2 158.0 110.5 126.5 125.8

An Example of what to Expect in Descriptive Analytics: Ex.1.1

str(sunspot.month)
##  Time-Series [1:3310] from 1749 to 2025: 96.7 104.3 116.7 92.8 141.7 ...

An Example of what to Expect in Descriptive Analytics: Ex.1.1

summary(sunspot.month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.20   68.00   81.99  122.70  398.20

An Example of what to Expect in Descriptive Analytics: Ex.1.1

library(ggplot2)
library(dplyr)

# Convert to tibble with time index
sunspot_df <- tibble(
  time = seq_along(sunspot.month),
  sunspots = as.numeric(sunspot.month)
)

ggplot(sunspot_df, aes(x = time, y = sunspots)) + 
  geom_point(alpha = 0.5, color = "steelblue") + 
  labs(
    y = "Number of Sunspots",
    x = "Time (Months since 1749)",
    title = "Historical Sunspot Activity"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

What is Predictive Analytics?

What is Predictive Analytics?

An Example of what to Expect in Predictive Analytics: Ex.2.1

library(quantmod)
library(lubridate)

# Define date range (last 5 years)
start_date <- Sys.Date() - years(5)
end_date <- Sys.Date() - days(2)

# Fetch stock data with error handling
suppressWarnings(
  getSymbols("AMZN", src = "yahoo", from = start_date, to = end_date, auto.assign = TRUE)
)
## [1] "AMZN"
str(AMZN)
## An xts object on 2020-11-13 / 2025-11-10 containing: 
##   Data:    double [1253, 6]
##   Columns: AMZN.Open, AMZN.High, AMZN.Low, AMZN.Close, AMZN.Volume ... with 1 more column
##   Index:   Date [1253] (TZ: "UTC")
##   xts Attributes:
##     $ src    : chr "yahoo"
##     $ updated: POSIXct[1:1], format: "2025-11-13 17:47:41"

An Example of what to Expect in Predictive Analytics: Ex.2.1

# Split data for training (first 1199 observations)
train_size <- min(1199, nrow(AMZN))
train_data <- AMZN[1:train_size, ]

# Build predictive model
predictive_model <- lm(
  formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume, 
  data = train_data
)

summary(predictive_model)
## 
## Call:
## lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9513 -0.8685 -0.0221  0.9510  9.3134 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.445e-02  2.484e-01  -0.058    0.954    
## AMZN.High    5.268e-01  2.427e-02  21.701   <2e-16 ***
## AMZN.Low     4.732e-01  2.473e-02  19.137   <2e-16 ***
## AMZN.Volume -8.230e-10  1.845e-09  -0.446    0.656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.415 on 1195 degrees of freedom
## Multiple R-squared:  0.9985, Adjusted R-squared:  0.9985 
## F-statistic: 2.704e+05 on 3 and 1195 DF,  p-value: < 2.2e-16

An Example of what to Expect in Predictive Analytics: Ex.2.1

# Diagnostic plots for model assessment
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))
plot(predictive_model, which = 1, main = "Residuals vs Fitted")
plot(predictive_model, which = 2, main = "Q-Q Plot")
plot(predictive_model, which = 3, main = "Scale-Location")
plot(predictive_model, which = 4, main = "Cook's Distance")
plot(predictive_model, which = 5, main = "Residuals vs Leverage")
par(mfrow = c(1, 1)) # Reset layout

An Example of what to Expect Analytics: Ex.2.1

# Make predictions on test set
n <- nrow(AMZN)
test_start <- min(1200, n)
test_data <- AMZN[test_start:n, ]

prediction <- predict(predictive_model, newdata = test_data)

# Display last predictions
tail(data.frame(
  predicted_close = prediction,
  actual_close = test_data$AMZN.Close
))

An Example of what to Expect Analytics: Ex.2.1

# Visualize predictions
plot(prediction, 
     type = "l", 
     col = "steelblue",
     lwd = 2,
     main = "Predicted Amazon Closing Prices",
     xlab = "Time Index",
     ylab = "Predicted Close Price ($)",
     las = 1)
grid(col = "gray90")

What is Prescriptive Analytics?

What does this translate into?

What is Data Analytics?

A Subcomponent of Data Analytics is Data Analysis!

A Subcomponent of Data Analytics is Data Analysis!

Other Types of Analysis

How to Correctly Apply Data Analytics?

Breaking Down the Research Process - The Initial Observation

Breaking Down the Research Process - The Initial Observation

Breaking Down the Research Process - The Initial Observation

Breaking Down the Research Process - Generating Theories

Breaking Down the Research Process - Creating a Hypothesis

Breaking Down the Research Process - Testing Theories & Hypotheses

Breaking Down the Research Process - Identifying the Variables

What’s After the Question & Identifying Variables?

What is Data?

Types of Measurements

Categorical Variables

Categorical Levels of Measurement - Binary

Categorical Levels of Measurement - Nominal

Categorical Levels of Measurement - Ordinal

Continuous Variables

Continuous Levels of Measurement - Interval

Continuous Levels of Measurement - Ratio

Consider Measurement Error:

How Valid Are My Measures?

Are My Measures Reliable?

Breaking Down the Research Process - Collecting the Data

Cross-Sectional Research

Longitudinal Research

Correlational Research

Experimental Research

Experimental Research - Methods

Experimental Research - Methods

Experimental Research - Methods

Breaking Down the Research Process - Methods to Collect the Data

Types of Variation in the Data to Consider:

Breaking Down the Research Process - Analyzing the Data

Population vs Sample

Fitting Models

Fitting Models

library(dplyr)

# Calculate mean sepal length by species
iris %>%
  group_by(Species) %>%
  summarise(mean_sepal_length = mean(Sepal.Length)) %>%
  knitr::kable(digits = 2, caption = "Mean Sepal Length by Species")
Mean Sepal Length by Species
Species mean_sepal_length
setosa 5.01
versicolor 5.94
virginica 6.59

Statistical Modeling Parameters

Statistical Modeling Parameters

library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Random sample of 15 observations
iris_sample <- iris %>%
  slice_sample(n = 15)

# Compare sample vs population means
bind_rows(
  iris_sample %>%
    group_by(Species) %>%
    summarise(mean_sepal_length = mean(Sepal.Length)) %>%
    mutate(type = "Sample (n=15)"),
  iris %>%
    group_by(Species) %>%
    summarise(mean_sepal_length = mean(Sepal.Length)) %>%
    mutate(type = "Population (n=150)")
) %>%
  select(type, Species, mean_sepal_length) %>%
  knitr::kable(digits = 3, caption = "Sample vs Population Comparison")
Sample vs Population Comparison
type Species mean_sepal_length
Sample (n=15) setosa 4.660
Sample (n=15) versicolor 5.660
Sample (n=15) virginica 6.440
Population (n=150) setosa 5.006
Population (n=150) versicolor 5.936
Population (n=150) virginica 6.588

Applicable Statistical Models

Summary